number theory
Mathematical AI helps researchers crack 50-year-old problem
Just a week after an AI disproved an 80-year-old conjecture and astonished mathematicians, another conjecture that had stood for half a century has fallen, inspired by the same techniques, but this time written entirely by humans. Last week, an unreleased AI model from OpenAI disproved an important conjecture first posed by Hungarian mathematician Paul Erdős, called the unit distance problem. The puzzle, which Erdős considered his "most striking contribution to geometry" and which many mathematicians had failed to unravel, concerns the number of similar-sized connections you can make between dots arranged on a flat surface. Erdős had set an upper ceiling on this number, which many experts had assumed was correct. But the AI model showed that this number could in fact be much larger, using an obscure trick from algebraic number theory to make complex structures with extremely high dimensions, which could then be used to arrange the dots in a very different arrangement than humans had considered.
Mathematicians stunned by AI's biggest breakthrough in mathematics yet
Mathematicians stunned by AI's biggest breakthrough in mathematics yet An 80-year-old maths conjecture that has eluded the world's greatest mathematicians has been cracked by an artificial intelligence model built by OpenAI. The result has stunned experts and is being hailed as a seismic moment for AI's mathematical ability. "This is a problem that I didn't expect to see solved in my lifetime," says Misha Rudnev at the University of Bristol, UK. "It's absolutely a bomb." Tim Gowers at the University of Cambridge wrote that the solution is "a milestone in AI mathematics" in a blog post accompanying the work . "If a human had written the paper and submitted it to the and I had been asked for a quick opinion, I would have recommended acceptance without any hesitation. No previous AI-generated proof has come close to that."
A New AI Math Startup Just Cracked 4 Previously Unsolved Problems
Axiom says its AI found solutions to several long-standing math problems, a sign of the technology's steadily advancing reasoning capabilities. Five years ago, mathematicians Dawei Chen and Quentin Gendron were trying to untangle a difficult area of algebraic geometry involving differentials, elements of calculus used to measure distance along curved surfaces . While working on one theorem, they ran into an unexpected roadblock: Their argument depended on a strange formula from number theory, but they were unable to solve or justify it. In the end, Chen and Gendron wrote a paper presenting their idea as a conjecture, rather than a theorem. Chen recently spent hours prompting ChatGPT in the hopes of getting the AI to come up with a solution to the still unsolved problem, but it wasn't working.
Amateur mathematicians solve long-standing maths problems with AI
Amateur mathematicians are using artificial intelligence chatbots to solve long-standing problems, in a move that has taken professionals by surprise. While the problems in question aren't the most advanced in the mathematical canon, the success of AI models in tackling them shows that their mathematical performance has passed a significant threshold, say researchers, and could fundamentally change the way we do mathematics. The questions being solved by AI originate from Hungarian mathematician Paul Erdős, who was famous for his ability to pose useful but difficult questions during a career that spanned over six decades. "The questions tended to be very simple, but very hard," says Thomas Bloom at the University of Manchester, UK. By his death in 1996, there were more than 1000 of these unsolved Erdős problems, spanning a wide range of mathematical disciplines, from combinatorics (the study of combinations) to number theory.
Skill-Targeted Adaptive Training
He, Yinghui, Panigrahi, Abhishek, Lin, Yong, Arora, Sanjeev
Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.
SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation
Wei, Hu, Xu, Ze, Yang, Boyu, Miao, Linlin, Zhai, Weiqi, Li, Yihan, Li, Zixuan, Wang, Zhijun, Wang, Boya, Yu, Jianwei, Yuan, Jialing, Zhang, Xiaoyue, He, Cheng, Chen, Minglei, Zhang, Zifan, Li, Qianhui, Wang, Wei, Xu, Xiang
Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.
Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning
This paper introduces the Primender sequence, a novel integer sequence defined by a hybrid rule that combines classical primality with modular digit-based conditions. Specifically, a number n is included in the sequence if it is prime or ends with a prime number of unit digit or any length. In other words, numbers which are primes or have at least one prime suffix. The resulting sequence exhibits a deterministic yet non-trivial structure, blending number-theoretic properties with symbolic patterning. We propose the Primender sequence as a benchmark for evaluating the symbolic reasoning capabilities of Large Language Models (LLMs). The study is motivated by the need for interpretable, rule-based testbeds that can assess an LLM's ability to infer hidden rules, validate mathematical hypotheses, and generalize symbolic logic at scale. A key hypothesis explored is: Whenever a number in the Primender sequence is exactly one more than the largest prime less than or equal to it, the difference between it and the previous number in the sequence is also 1. We design a structured prompt and evaluation framework to test this hypothesis across multiple state-of-the-art LLMs, including ChatGPT, Copilot, DeepSeek, Gemini, Grok, and LLaMA. The models are tasked with identifying the underlying rule, validating the hypothesis, and generating the next 100,000 terms of the sequence. Comparative metrics such as rule inference accuracy, hypothesis evaluation, sequence validity, and symbolic explanation quality are used to assess model performance. This work contributes a novel mathematical construct and a reproducible methodology for benchmarking LLMs in symbolic reasoning, hypothesis testing, and scalable pattern generalization - bridging the domains of number theory, artificial intelligence, and software engineering.
Primality Testing via Circulant Matrix Eigenvalue Structure: A Novel Approach Using Cyclotomic Field Theory
This paper presents a novel primality test based on the eigenvalue structure of circulant matrices constructed from roots of unity. We prove that an integer $n > 2$ is prime if and only if the minimal polynomial of the circulant matrix $C_n = W_n + W_n^2$ has exactly two irreducible factors over $\mathbb{Q}$. This characterization connects cyclotomic field theory with matrix algebra, providing both theoretical insights and practical applications. We demonstrate that the eigenvalue patterns of these matrices reveal fundamental distinctions between prime and composite numbers, leading to a deterministic primality test. Our approach leverages the relationship between primitive roots of unity, Galois theory, and the factorization of cyclotomic polynomials. We provide comprehensive experimental validation across various ranges of integers, discuss practical implementation considerations, and analyze the computational complexity of our method in comparison with established primality tests. The visual interpretation of our mathematical framework provides intuitive understanding of the algebraic structures that distinguish prime numbers. Our experimental validation demonstrates that our approach offers a deterministic alternative to existing methods, with performance characteristics reflecting its algebraic foundations.
Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving
Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.
Improving Math Problem Solving in Large Language Models Through Categorization and Strategy Tailoring
In this paper, we explore how to leverage large language models (LLMs) to solve mathematical problems efficiently and accurately. Specifically, we demonstrate the effectiveness of classifying problems into distinct categories and employing category-specific problem-solving strategies to improve the mathematical performance of LLMs. We design a simple yet intuitive machine learning model for problem categorization and show that its accuracy can be significantly enhanced through the development of well-curated training datasets. Additionally, we find that the performance of this simple model approaches that of state-of-the-art (SOTA) models for categorization. Moreover, the accuracy of SOTA models also benefits from the use of improved training data. Finally, we assess the advantages of using category-specific strategies when prompting LLMs and observe significantly better performance compared to non-tailored approaches.